home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 1
/
Cream of the Crop 1.iso
/
PROGRAM
/
SMOOTH11.ARJ
/
SMOOTH.DOC
< prev
next >
Wrap
Text File
|
1991-06-09
|
8KB
|
209 lines
NAME
smooth - split linear smoothing
SYNOPSIS
smooth [file] [options]
USAGE
By default, SMOOTH reads pairs of numbers (x- and y-values)
from the standard input (or the given file), fits a smooth
curve to the points, and writes to the standard output points
from the smooth curve.
Two smoothing algorithms are available. By default, the curve
is calculated using the "lowness" procedure developed by W. S.
Cleveland (see below). This technique achieves robustness by
decreasing weights on data points which are far from the fitted
line. An alternate procedure due to Art Owen is also provided
(with the -s switch). This technique smooths the data while
preserving sharp discontinuities in slope or value.
As with GRAPH, each pair of points may optionally be followed
by a comment. If the comment is surrounded by quotes "...",
the comment may contain spaces. The given points, and their
comments if any, will be included in the output. The
interpolation may optionally be restarted after each label, so
that a family of curves may be processed together (see the -b
switch).
Input lines starting with ";" are copied to the beginning of
the output file but are otherwise ignored. Blank lines are
ignored.
OPTIONS
Options can appear anywhere on the command line.
-a [step [start]] automatic abscissas
-b break smooth after each label
-c general curve
-f <num> for "lowness", the fraction of points to use for
each fitted value (default .5)
-n <num> for "lowness", the number of points to use for
each fitted value (default 50%)
-r print residuals rather than smoothed values
-s split linear fit rather than "lowness"
-xl take logs of x values before smoothing
-yl take logs of y values before smoothing
-zl take logs of z values before interpolating
(implies -3)
-3 3D case: x, y, and z given for each point
If the -c switch is not used, the input points must be from a
function - that is, the x values must be strictly increasing.
The output points will also be from a function. (If the -b
switch is used, this restriction applies only within each
segment.)
If the -c switch is used (indicating a general curve), the
input points need not be from a function, but each pair of
points must be separated from the previous pair by a finite
distance. (If the -b switch is used, this restriction applies
only within each segment.)
The -f or -n switch designate the number of data points used to
calculate a given smoothed value. The larger the number, the
smoother the resulting curve. It is not possible to specify
the number in terms of a range of the independent variable
(e.g. a "time constant"). Therefore, these methods are
appropriate when the density of data points is approximately
constant, or else the density is higher in the "interesting"
(i.e. rapidly changing) part of the curve.
The distinction between the -f and -n switch becomes useful
only when there are several data sets. Suppose one had two
data sets for the same range of independent variables, and that
one set had twice the number of data points as the other. For
equivalent treatment, one could smooth the two sets with the
same value for the -f switch.
On the other hand, suppose two sets of data have data points at
the same density, but that one set covered twice the range of
independent variable (and therefore had twice as many data
points). For equivalent smoothing, one could use the -n switch
with the same value in each case.
For general curves, the given x- and y- (and z-, if present) points
are regarded as functions of the distance along a smoothed path. This
doesn't work very well for split linear smoothing, since it tends to
conceal abrupt changes in position. However, the split linear smooth
is still able to preserve abrupt changes in the first derivative.
METHODS
Lowness by W. S. Cleveland, and split linear fit by A. Owen
Lowness
Robust locally weighted regression is a method for
smoothing a scatterplot, (x[i], y[i]), i=1,...,n, in
which the fitted value at x[k] is the value of a
polynomial fit to the data using weighted least
squares, where the weight for (x[i], y[i]) is large if
x[i] is close to x[k]. Robustness is added by
calculating residuals and repeating the procedure with
reduced weights on points with large residuals.
Reference:
W. S. Cleveland, "Robust Locally Weighted Regression
and Smoothing Scatterplots", Journal of the American
Statistical Association, v74, n368, p829 (Dec 79)
Split Linear Smoothing Algorithm
Given:
A list of window sizes, SizeList, and n pairs (x[i],y[i]) sorted on x,
Returns:
the split linear smooth of y on x.
The general technique is due to Art Owen, who offers this
discussion:
"You should feel free to experiment with the
algorithm, since it has some ad hoc parts. The
essentials are: to use uncentered windows of
varying sizes along with the central ones, to
get zero weight on the worst fitting lines, and
to make the weight attached to a particular
line size and orientation vary smoothly as one
traverses the data. We tried to find a simple
way to meet all of these goals; the algorithm
we settled on was the simplest that worked for
us. ...
"West and Chan et. al. are useful for getting
numerically stable updating formulae for the
regressions."
references...
John Alan McDonald and Art B. Owen, "Smoothing with
Split Linear Fits", LCS Technical Report No.
7, SLAC-PUB 3423, AD-A149032, Laboratory for
Computational Statistics, Dept. of Statistics,
Stanford University, July 1984.
West, D.H.D., 1979, Updating Mean and Variance
Estimates: An Improved Method, Communications
of the ACM, v 22, no. 9 p 532-535 (1979).
Chan, T.F., Golub, G.H., and Leveque, R.J., 1983,
Algorithms for Computing the Sample Variance:
Analysis and Recommendations, The American
Statistician v 37, p 242-247 (1983).
IMPLEMENTATION
The implementation of the split linear smoothing is based on
pseudocode by Art Owen.
The arrays take a lot of space. For n points, the number of
doubles is approximately 38*n, plus 2*n for general curve, plus
2*n for 3D case. For 100 points and 8 byte doubles, this means
at least 8*38*100=30400 bytes.
Execution time... The program will employ a numeric
coprocessor if it is available, but will run correctly without
it. Time for "lowness" is proportional to the square of the
number of data points. 101 points took 151 seconds on a 7.5
MHz V-20, with no 8087, but only 0.98 seconds on a 20 MHz 80386
with an 80387. Time for split linear smoothing increases
slightly faster than linearly in the number of data points.
The updating formulas mentioned by Art Owen are not used in
this program. The selection of window sizes (a geometric
sequence) is my own. -JVZ
EXAMPLES
The file ROUGH contains data points from sin(x) with one abrupt
phase reversal (creating a discontinuity) and some added noise.
To see the effect of the two algorithms, try
C>smooth rough -f .2 >rlow
C>smooth rough -s >rsl
Then display all three files with GRAPH...
C>graph rough rlow rsl -m -32 10 20
Note how the split linear smooth preserved the discontinuity
whereas "lowness" smoothed it out somewhat.
The file SP contains points from a general curve...
C>smooth sp -f .2 -c >splow
C>smooth sp -s -c >spsl
C>graph sp splow spsl -m 1 10 20
This input file has a discontinuity in the first derivative
which the split linear smooth was able to preserve.
AUTHOR
Copyright (c) 1987, 1991 by James R. Van Zandt
(jrv@mbunix.mitre.org) 27 Spencer Dr., Nashua NH 03062,
603-888-2272. Resale forbidden, copying for personal use
encouraged. Constructive comments welcome.